library(tidyverse)
library(knitr)
library(broom)
library(forcats)
library(stringr)
library(ggrepel)
library(modelr)
library(plotly)

library(socviz)

options(digits = 3)
set.seed(1234)

theme_set(theme_minimal())

Tables or graphs

  • Discuss articles
  • Debate when to use tables vs graphs
  • Paper vs. poster

Descriptive statistics/exploratory graphs

Smoothing lines

When examining multivariate continuous data, scatterplots are a quick and easy visualization to assess relationships. However if the data points become too densely clustered, interpreting the graph becomes difficult. Consider the diamonds dataset:

p <- ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Carat size",
       y = "Price")
p

What is the relationship between carat size and price? It appears positive, but there are also a lot of densely packed data points in the middle of the graph. Smoothing lines are a method for summarizing the relationship between variables to capture important patterns by approximating the functional form of the relationship. The functional form can take on many shapes. For instance, a very common functional form is a best-fit line, also known as ordinary least squares (OLS) or simple linear regression. We can estimate the model directly using lm(), or we can directly plot the line by using geom_smooth(method = "lm"):

p +
  geom_smooth(method = "lm", se = FALSE)

The downside to a linear best-fit line is that it assumes the relationship between the variables is additive and monotonic. Therefore the summarized relationship between carat size and price seems wildly incorrect for diamonds with a carat size larger than 3. Instead we could use a generalized additive model which allow for flexible, non-linear relationships between the variables while still implementing a basic regression approach:1

p +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Locally weighted scatterplot smoothing (local regression, LOWESS, or LOESS) fits a separate non-linear function at each target point \(x_0\) using only the nearby training observations. This method estimates a regression line based on localized subsets of the data, building up the global function \(f\) point-by-point.

Here is an example of a local linear regression on the ethanol dataset in the lattice package:

The LOESS is built up point-by-point:

One important argument you can control with LOESS is the span, or how smooth the LOESS function will become. A larger span will result in a smoother curve, but may not be as accurate. A smaller span will be more local and wiggly, but improve our fit to the training data.

LOESS lines are best used for datasets with fewer than 1000 observations, otherwise the time and memory usage required to compute the line increases exponentially.

Show several smoothing lines at once with a legend

We can draw several smoothing lines at once on a plot by calling geom_smooth() multiple times. For instance, here we draw a graph with three smoothing lines:

  1. OLS
  2. Splines
  3. LOESS
library(gapminder)

ggplot(data = gapminder,
       mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm") +
  geom_smooth(method = "lm", formula = y ~ splines::bs(x, df = 3)) +
  geom_smooth(method = "loess")

The problem is how can we distinguish one smoothing line from another? With the fill and color aesthetics. Rather than pass in a new variable, we manually label the fill and color aesthetics with character strings inside aes():

model_colors <- RColorBrewer::brewer.pal(3, "Set1")
model_colors
## [1] "#E41A1C" "#377EB8" "#4DAF4A"
ggplot(data = gapminder,
       mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
  geom_smooth(method = "lm", formula = y ~ splines::bs(x, df = 3),
              aes(color = "Cubic Spline", fill = "Cubic Spline")) +
  geom_smooth(method = "loess",
              aes(color = "LOESS", fill = "LOESS")) +
  scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors) +
  theme(legend.position = "top")

By passing these in as character values, ggplot essentially creates a new variable for this mapping. We can then use scale_() functions to assign specific colors to those mappings.

Coefficient of correlation

  • Produces a measure of association, known as Pearson’s \(r\), that gauges the direction and strength of a relationship between two continuous variables
  • Scales between \(-1\) and \(+1\)
  • \(-1\) – perfect negative association between the variables
  • \(+1\) – perfect positive association between the variables
  • \(0\) – no relationship between the variables
  • Unit-less measure - no matter what scale the variables fall on (e.g. turnout, education, income), the number will always fall between \(-1\) and \(+1\)
r_plot <- function(r, n = 100){
  xy <- ecodist::corgen(len = n, r = r) %>%
    bind_cols
  
  ggplot(xy, aes(x, y)) +
    geom_point() +
    ggtitle(str_c("Pearson's r = ", r))
}

r <- c(.8, 0, -.8)

for(r in r){
  print(r_plot(r))
}

Scatterplot matricies

To quickly visualize several variables in a dataset and their relation to one another, a scatterplot matrix is a quick and detailed tool for generating a series of scatterplots for each combination of variables. Consider credit.csv which contains a sample of individuals from a credit card company, identifying their total amount of credit card debt and other financial/demographic variables:

credit <- read_csv("data/Credit.csv") %>%
  # remove first ID column
  select(-X1)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   Income = col_double(),
##   Limit = col_integer(),
##   Rating = col_integer(),
##   Cards = col_integer(),
##   Age = col_integer(),
##   Education = col_integer(),
##   Gender = col_character(),
##   Student = col_character(),
##   Married = col_character(),
##   Ethnicity = col_character(),
##   Balance = col_integer()
## )
names(credit) <- stringr::str_to_lower(names(credit))   # convert column names to lowercase
glimpse(credit)
## Observations: 400
## Variables: 11
## $ income    <dbl> 14.9, 106.0, 104.6, 148.9, 55.9, 80.2, 21.0, 71.4, 1...
## $ limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300...
## $ rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 58...
## $ cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3...
## $ age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, ...
## $ education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9,...
## $ gender    <chr> "Male", "Female", "Male", "Female", "Male", "Male", ...
## $ student   <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No...
## $ married   <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "...
## $ ethnicity <chr> "Caucasian", "Asian", "Asian", "Asian", "Caucasian",...
## $ balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, ...

If we want to quickly assess the relationship between all of the variables (in preparation for more advanced statistical learning techniques), we could generate a matrix of scatterplots using the base graphics package:

pairs(select_if(credit, is.numeric))

  • This only works well if we use strictly quantitative variables (hence the use of select_if())
  • We don’t automatically see correlation information
  • It’s not built using ggplot2 so it’s hard to modify using techniques with which we are already familiar

Instead, we can use GGally::ggpairs() to generate a scatterplot matrix. GGally is a package for R that extends ggplot2 by adding helper functions for common multivariate data structures. ggpairs() is a function that allows us to quickly generate a scatterplot matrix.

library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
ggpairs(select_if(credit, is.numeric))

When applied to strictly numeric variables, the lower triangle generates scatterplots, the upper triangle prints the correlation coefficient, and the diagonal panels are density plots of the variable.

Because ggpairs() is ultimately based on ggplot(), we can use the same types of commands to modify the graph. For instance, if we want to use the color aesthetic to distinguish between men and women in the dataset:

ggpairs(credit, mapping = aes(color = gender),
        columns = c("income", "limit", "rating", "cards", "age", "education", "balance"))

Or if we wanted to draw a smoothing line instead of scatterplots, we can modify the graph’s matrix sections:

ggpairs(select_if(credit, is.numeric),
        lower = list(
          continuous = "smooth"
        )
)

Hmm, too difficult to see the smoothers because the points are so dense. We can use wrap() to pass through individual parameters to the underlying geom_():

ggpairs(select_if(credit, is.numeric),
        lower = list(
          continuous = wrap("smooth", alpha = .1, color = "blue")
        )
)

Or we can write a custom function and apply it to the lower triangle panels:

scatter_smooth <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    # make data points transparent
    geom_point(alpha = .2) +
    # add default smoother
    geom_smooth(se = FALSE)
}

ggpairs(select_if(credit, is.numeric),
        lower = list(
          continuous = scatter_smooth
        )
)

ggpairs(credit, mapping = aes(color = gender),
        columns = c("income", "limit", "rating", "cards", "age", "education", "balance"),
        lower = list(
          continuous = scatter_smooth
        )
)

ggpairs() also works on datasets with a mix of qualitative and quantitative variables, drawing appropriate graphs based on whether the variables are continuous or discrete:

rcfss::scorecard %>%
  select(type:debt) %>%
  ggpairs

Heatmap of correlation coefficients

Scatterplot matricies can provide lots of information, but can also be very densely packed. Perhaps instead we want to quickly visualize the correlation between each of the variables.2 We can easily calculate the correlation coefficients using cor():

(mpg_lite <- select_if(mpg, is.numeric))
## # A tibble: 234 x 5
##    displ  year   cyl   cty   hwy
##    <dbl> <int> <int> <int> <int>
##  1  1.80  1999     4    18    29
##  2  1.80  1999     4    21    29
##  3  2.00  2008     4    20    31
##  4  2.00  2008     4    21    30
##  5  2.80  1999     6    16    26
##  6  2.80  1999     6    18    26
##  7  3.10  2008     6    18    27
##  8  1.80  1999     4    18    26
##  9  1.80  1999     4    16    25
## 10  2.00  2008     4    20    28
## # ... with 224 more rows
(cormat <- mpg_lite %>%
  cor %>%
  round(2))
##       displ  year   cyl   cty   hwy
## displ  1.00  0.15  0.93 -0.80 -0.77
## year   0.15  1.00  0.12 -0.04  0.00
## cyl    0.93  0.12  1.00 -0.81 -0.76
## cty   -0.80 -0.04 -0.81  1.00  0.96
## hwy   -0.77  0.00 -0.76  0.96  1.00

But who likes yucky tables. Instead let’s turn this into a heatmap. First we need to reshape the data into a tidy structure:

  1. Each row contains a single observation
  2. Each column contains a single variable
  3. Each cell contains a single value

What we need is a data frame with three columns:

  1. First variable name
  2. Second variable name
  3. Correlation coefficient

We can use reshape2::melt() to quickly accomplish this:

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
(melted_cormat <- melt(cormat))
##     Var1  Var2 value
## 1  displ displ  1.00
## 2   year displ  0.15
## 3    cyl displ  0.93
## 4    cty displ -0.80
## 5    hwy displ -0.77
## 6  displ  year  0.15
## 7   year  year  1.00
## 8    cyl  year  0.12
## 9    cty  year -0.04
## 10   hwy  year  0.00
## 11 displ   cyl  0.93
## 12  year   cyl  0.12
## 13   cyl   cyl  1.00
## 14   cty   cyl -0.81
## 15   hwy   cyl -0.76
## 16 displ   cty -0.80
## 17  year   cty -0.04
## 18   cyl   cty -0.81
## 19   cty   cty  1.00
## 20   hwy   cty  0.96
## 21 displ   hwy -0.77
## 22  year   hwy  0.00
## 23   cyl   hwy -0.76
## 24   cty   hwy  0.96
## 25   hwy   hwy  1.00

We can then use geom_tile() to visualize the correlation matrix:

ggplot(melted_cormat, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile()

Not exactly pretty. We can clean it up first by reducing redundancy (remember the upper and lower triangles provide duplicate information):

# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)
upper_tri
##       displ year  cyl   cty   hwy
## displ     1 0.15 0.93 -0.80 -0.77
## year     NA 1.00 0.12 -0.04  0.00
## cyl      NA   NA 1.00 -0.81 -0.76
## cty      NA   NA   NA  1.00  0.96
## hwy      NA   NA   NA    NA  1.00

Now melt upper_tri and repeat the same process, cleaning up the colors for the heatmap as well to distinguish between positive and negative coefficients:

melted_cormat <- melt(upper_tri, na.rm = TRUE)

ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white") +
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1)) +
 coord_fixed()

We can also reorder the correlation matrix according to correlation coefficient to help reveal additional trends:

reorder_cormat <- function(cormat){
  # Use correlation between variables as distance
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <-cormat[hc$order, hc$order]
}

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)

# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)

# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1))+
  coord_fixed()

# Print the heatmap
print(ggheatmap)

Finally we can directly label the correlation coefficient values on the graph, so we have both the color channel and exact values:

ggheatmap + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "bottom")

To make it more flexible, we can also turn all of this into a function that works for any dataset:

cormat_heatmap <- function(data){
  # generate correlation matrix
  cormat <- round(cor(data), 2)
  
  # melt into a tidy table
  get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
  }
  
  upper_tri <- get_upper_tri(cormat)
  
  # reorder matrix based on coefficient value
  reorder_cormat <- function(cormat){
    # Use correlation between variables as distance
    dd <- as.dist((1-cormat)/2)
    hc <- hclust(dd)
    cormat <-cormat[hc$order, hc$order]
  }
  
  cormat <- reorder_cormat(cormat)
  upper_tri <- get_upper_tri(cormat)
  
  # Melt the correlation matrix
  melted_cormat <- melt(upper_tri, na.rm = TRUE)
  
  # Create a ggheatmap
  ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
    geom_tile(color = "white")+
    scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                         midpoint = 0, limit = c(-1,1), space = "Lab", 
                         name="Pearson\nCorrelation") +
    theme_minimal()+ # minimal theme
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                     size = 12, hjust = 1))+
    coord_fixed()
  
  # add correlation values to graph
  ggheatmap + 
    geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
    theme(
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      panel.grid.major = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank(),
      axis.ticks = element_blank(),
      legend.position = "bottom")
}

cormat_heatmap(select_if(mpg, is.numeric))

cormat_heatmap(select_if(credit, is.numeric))

cormat_heatmap(select_if(diamonds, is.numeric))

Parallel coordinate plots

Parallel coordinate plots are an alternative graphical format for multivariate data analysis (continuous or discrete). They can be quite busy and messy. Key things for parallel coordinate plots:

  • Ordering variables in different ways helps to identify relevant patterns. Therefore a lot of this is trial and error
  • Adding interactivity (as we will see in later weeks) helps
ggparcoord(data = iris, columns = 1:4, groupColumn = 5)

# with the iris data, order the axes by overall class (Species) separation
# using the anyClass option
ggparcoord(data = iris, columns = 1:4, groupColumn = 5, order = "anyClass")

# add points to the plot, add a title, and use an alpha scalar to make the
# lines transparent
p <- ggparcoord(data = iris, columns = 1:4, groupColumn = 5, order = "anyClass", 
    showPoints = TRUE, title = "Parallel Coordinate Plot for the Iris Data", 
    alphaLines = 0.3)
p

# add some basic interactivity
ggplotly(p)

Three-dimensional plots

Adding a third (or fourth) dimension to a two-dimensional plot is relatively trivial when at least one of the variables is discrete:

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point()

However what happens when you have three continuous dimensions to represent? Can we draw 3D graphs in R? Not easily, and interpreting 3D graphs can also be challenging mentally. ggplot2 cannot draw graphs in three dimensions. One possibility is to keep the data in two physical dimensions by using geom_tile() and adding the third dimension using the fill aesthetic (color channel). For example, say we estimate a logistic regression model of the probability of voting in the 1996 US presidential election and we want to visualize the predicted probability of survival for each combination of these variable:

# import data
(vote <- rcfss::mental_health)
## # A tibble: 1,317 x 5
##    vote96   age  educ female mhealth
##     <dbl> <dbl> <dbl>  <dbl>   <dbl>
##  1     1.   60.   12.     0.      0.
##  2     1.   36.   12.     0.      1.
##  3     0.   21.   13.     0.      7.
##  4     0.   29.   13.     0.      6.
##  5     1.   39.   18.     1.      2.
##  6     1.   41.   15.     1.      1.
##  7     1.   48.   20.     0.      2.
##  8     0.   20.   12.     1.      9.
##  9     0.   27.   11.     1.      9.
## 10     0.   34.    7.     1.      2.
## # ... with 1,307 more rows
# estimate model
vote_glm <- glm(vote96 ~ age + educ, data = vote, family = "binomial")
tidy(vote_glm)
##          term estimate std.error statistic  p.value
## 1 (Intercept)  -5.0244   0.44482     -11.3 1.38e-29
## 2         age   0.0469   0.00441      10.6 1.94e-26
## 3        educ   0.2816   0.02629      10.7 9.32e-27
# extract predicted probabilities
vote_prob <- vote %>%
  data_grid(age = 18:89, educ = 0:20) %>%
  modelr::add_predictions(vote_glm) %>%
  # convert predicted values to probabilities
  mutate(prob = rcfss::logit2prob(pred))

ggplot(vote_prob, aes(age, educ, fill = prob)) +
  geom_tile() +
  scale_fill_gradient2(midpoint = .5, label = scales::percent) +
  labs(title = "Probability of voter turnout in 1996",
       x = "Age",
       y = "Education (in years)",
       fill = "Probability\nof voting")

# cleaner image using geom_raster and interpolate = TRUE
ggplot(vote_prob, aes(age, educ, fill = prob)) +
  geom_raster(interpolate = TRUE) +
  scale_fill_gradient2(midpoint = .5, label = scales::percent) +
  labs(title = "Probability of voter turnout in 1996",
       x = "Age",
       y = "Education (in years)",
       fill = "Probability\nof voting")

plotly

If we wanted to represent this in true three-dimensional fashion, we could use plotly:

plot_ly(vote_prob, x = ~age, y = ~educ, z = ~prob) %>%
  add_mesh()
plot_ly(credit, x = ~limit, y = ~balance, z = ~income) %>%
  add_mesh()

3D surface plot

plot_ly(z = ~volcano) %>% add_surface()
volcano %>%
  melt %>%
  ggplot(aes(Var1, Var2, z = value)) +
  geom_contour(aes(color = ..level..))

Mosaic plot

What is the relationship between happiness and gender? We could identify this in several different contingency tables, depending on the probability distribution on which we wish to focus:

# Mosaic plot of happiness and education
library(productplots)
data("happy")

happy <- happy %>%
  na.omit

Joint distribution

  • \(f(\text{happy}, \text{sex})\)
# contingency table
library(descr)
crosstab(happy$happy, happy$sex, prop.t = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |           Total Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                   5.5%     7.1%        
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  25.2%    30.3%        
## ---------------------------------------
## very happy       4833     6327   11160 
##                  13.9%    18.2%        
## ---------------------------------------
## Total           15497    19326   34823 
## =======================================

Conditional distribution of sex given happiness

  • \(f(\text{sex} | \text{happy})\)
  • \(f(\text{happy} | \text{sex})\)
crosstab(happy$happy, happy$sex, prop.r = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |             Row Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                  43.6%    56.4%   12.5%
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  45.4%    54.6%   55.4%
## ---------------------------------------
## very happy       4833     6327   11160 
##                  43.3%    56.7%   32.0%
## ---------------------------------------
## Total           15497    19326   34823 
## =======================================
crosstab(happy$happy, happy$sex, prop.c = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                  12.3%    12.7%        
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  56.5%    54.5%        
## ---------------------------------------
## very happy       4833     6327   11160 
##                  31.2%    32.7%        
## ---------------------------------------
## Total           15497    19326   34823 
##                  44.5%    55.5%        
## =======================================

Conditional distribution of happiness given sex and marginal distribution of sex

  • \(f(\text{happy})\) and \(f(\text{sex})\)
crosstab(happy$happy, happy$sex, prop.c = TRUE, prop.r = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |             Row Percent | 
## |          Column Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                  43.6%    56.4%   12.5%
##                  12.3%    12.7%        
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  45.4%    54.6%   55.4%
##                  56.5%    54.5%        
## ---------------------------------------
## very happy       4833     6327   11160 
##                  43.3%    56.7%   32.0%
##                  31.2%    32.7%        
## ---------------------------------------
## Total           15497    19326   34823 
##                  44.5%    55.5%        
## =======================================

Each of the contingency tables encourages a different type of comparison, therefore the author has to decide in advance which comparison is most important and include the appropriate table. Alternatively, we can visualize this information using a mosaic plot, whereby the area of each rectangle is proportional to the number of observations falling into the respective contengencies.

There are a few different packages available for drawing mosaic plots in R.

graphics::mosaicplot()

mosaicplot(~ sex + happy, data = happy)

vcd::mosaic()

library(vcd)
## Loading required package: grid
## 
## Attaching package: 'vcd'
## The following objects are masked from 'package:productplots':
## 
##     mosaic, spine, tile
mosaic(~ happy + sex, happy)

productplots::prodplot()

# mosaic plot using productplots
prodplot(happy, ~ happy + sex)

# add color
prodplot(happy, ~ happy + sex) +
  aes(fill = happy) +
  theme(panel.grid = element_blank())

prodplot(happy, ~ happy + marital) +
  aes(fill = happy) +
  theme(legend.position = "none") +
  theme(panel.grid = element_blank())

Notice that the mosaic plot is very similar to a proportional bar chart:

ggplot(happy, aes(marital, fill = happy)) +
  geom_bar(position = "fill")

However unlike a proportional bar chart, the bar widths are constant and therefore we do not know what proportion of individuals in the survey are married vs. never married, or any other similar comparison.

Dot plot for summary statistics

library(ISLR)
## 
## Attaching package: 'ISLR'
## The following object is masked from 'package:vcd':
## 
##     Hitters
OJ_sum <- OJ %>%
  select(ends_with("MM"), ends_with("CH")) %>%
  gather(var, value) %>%
  group_by(var) %>%
  summarize(mean = mean(value),
            sd = sd(value),
            min = min(value),
            max = max(value),
            n = n())

# print the table
kable(OJ_sum)
var mean sd min max n
DiscCH 0.052 0.117 0.00 0.500 1070
DiscMM 0.123 0.214 0.00 0.800 1070
LoyalCH 0.566 0.308 0.00 1.000 1070
PctDiscCH 0.027 0.062 0.00 0.253 1070
PctDiscMM 0.059 0.102 0.00 0.402 1070
PriceCH 1.867 0.102 1.69 2.090 1070
PriceMM 2.085 0.134 1.69 2.290 1070
SalePriceCH 1.816 0.143 1.39 2.090 1070
SalePriceMM 1.962 0.253 1.19 2.290 1070
SpecialCH 0.148 0.355 0.00 1.000 1070
SpecialMM 0.162 0.368 0.00 1.000 1070
# plot using a single dot plot
ggplot(OJ_sum, aes(x = fct_reorder(var, mean), y = mean)) +
  geom_linerange(aes(ymin = mean - 2 * sd,
                      ymax = mean + 2 * sd),
                  linetype = 2,
                 size = .25) +
  geom_linerange(aes(ymin = mean - sd,
                      ymax = mean + sd),
                  size = 1) +
  geom_point() +
  coord_flip() +
  labs(x = NULL,
       y = NULL)

# dodge based on OJ brand
OJ_sum %>%
  separate(var, into = c("var", "brand"), -3, remove = TRUE) %>%
  ggplot(aes(x = fct_reorder(var, mean), y = mean, color = brand)) +
  geom_linerange(aes(ymin = mean - 2 * sd,
                      ymax = mean + 2 * sd),
                  linetype = 2,
                 size = .25,
                 position = position_dodge(width = 0.5)) +
  geom_linerange(aes(ymin = mean - sd,
                      ymax = mean + sd),
                  size = 1,
                 position = position_dodge(width = 0.5)) +
  geom_point(position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       color = "Brand")

# facet based on OJ brand
OJ_sum %>%
  separate(var, into = c("var", "brand"), -3, remove = TRUE) %>%
  ggplot(aes(x = fct_reorder(var, mean), y = mean)) +
  facet_grid(. ~ brand) +
  geom_linerange(aes(ymin = mean - 2 * sd,
                      ymax = mean + 2 * sd),
                  linetype = 2,
                 size = .25) +
  geom_linerange(aes(ymin = mean - sd,
                      ymax = mean + sd),
                  size = 1) +
  geom_point() +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       color = "Brand")

Model results

How to do it right

Designing good visualizations of models is tough because not only are you representing data, but also a specific statistical learning method used to summarize and structure the data. The more complex the model, the trickier this becomes. Not only do you have to estimate the model appropriately, you have to visually depict the results in an informative and not overly-complex manner. Some basic rules of thumb follow.

Present your findings in substantive terms

Show the results in context. If you are holding other variables constant, make sure they are sensible values. To show changes in continuous variables, move meaningfully across the distribution. For discrete variables, predicted values might be presented with respect to the modal or median value in the data.

Present the results in an interpretable scale. So if the modeling strategy produces log-odds, convert the estimates to probabilities so people can decipher the meaning of the results.

Show your degree of confidence

Don’t just focus on point estimates. Also account for your uncertainty surrounding the point estimates. Use functions such as geom_pointrange() and geom_errorbar() to visualize parameter estimates, or geom_ribbon() for line graphs.

Show your data when you can

  • Tables
    • Parameter estimates
    • Standard errors
    • Model statistics
  • Visualizations
    • Predicted values
    • Original data points

Extracting model contents

Recall the gapminder dataset, which includes measures of life expectancy over time for all countries in the world.

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Let’s say we want to try and understand how life expectancy changes over time. We could visualize the data using a line graph:

gapminder %>% 
  ggplot(aes(year, lifeExp, group = country)) +
    geom_line(alpha = 1/3)

But this is incredibly noise. Why not estimate a simple linear model that summarizes this trend?

gapminder_mod <- lm(lifeExp ~ year, data = gapminder)
summary(gapminder_mod)
## 
## Call:
## lm(formula = lifeExp ~ year, data = gapminder)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.95  -9.65   1.70  10.33  22.16 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -585.6522    32.3140   -18.1   <2e-16 ***
## year           0.3259     0.0163    20.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 1702 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  399 on 1 and 1702 DF,  p-value: <2e-16
grid <- gapminder %>% 
  data_grid(year, country) 
grid
## # A tibble: 1,704 x 2
##     year country    
##    <int> <fct>      
##  1  1952 Afghanistan
##  2  1952 Albania    
##  3  1952 Algeria    
##  4  1952 Angola     
##  5  1952 Argentina  
##  6  1952 Australia  
##  7  1952 Austria    
##  8  1952 Bahrain    
##  9  1952 Bangladesh 
## 10  1952 Belgium    
## # ... with 1,694 more rows
grid <- grid %>% 
  add_predictions(gapminder_mod) 
grid
## # A tibble: 1,704 x 3
##     year country      pred
##    <int> <fct>       <dbl>
##  1  1952 Afghanistan  50.5
##  2  1952 Albania      50.5
##  3  1952 Algeria      50.5
##  4  1952 Angola       50.5
##  5  1952 Argentina    50.5
##  6  1952 Australia    50.5
##  7  1952 Austria      50.5
##  8  1952 Bahrain      50.5
##  9  1952 Bangladesh   50.5
## 10  1952 Belgium      50.5
## # ... with 1,694 more rows
ggplot(gapminder, aes(year, group = country)) +
  geom_line(aes(y = lifeExp), alpha = .2) +
  geom_line(aes(y = pred), data = grid, color = "red", size = 1)

So it appears that there is a positive trend - that is, over time life expectancy is rising. But we can also see a lot of variation in that trend - some countries are doing much better than others. We’ll come back to that in a bit.

Extracting model statistics

Model objects are not very pretty in R. Recall the complicated data structure I mentioned above:

str(gapminder_mod)
## List of 12
##  $ coefficients : Named num [1:2] -585.652 0.326
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "year"
##  $ residuals    : Named num [1:1704] -21.7 -21.8 -21.8 -21.4 -20.9 ...
##   ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:1704] -2455.1 232.2 -20.8 -20.5 -20.2 ...
##   ..- attr(*, "names")= chr [1:1704] "(Intercept)" "year" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:1704] 50.5 52.1 53.8 55.4 57 ...
##   ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "year"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.02 1.03
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 1702
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = lifeExp ~ year, data = gapminder)
##  $ terms        :Classes 'terms', 'formula'  language lifeExp ~ year
##   .. ..- attr(*, "variables")= language list(lifeExp, year)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "lifeExp" "year"
##   .. .. .. ..$ : chr "year"
##   .. ..- attr(*, "term.labels")= chr "year"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(lifeExp, year)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
##  $ model        :'data.frame':   1704 obs. of  2 variables:
##   ..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
##   ..$ year   : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language lifeExp ~ year
##   .. .. ..- attr(*, "variables")= language list(lifeExp, year)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "lifeExp" "year"
##   .. .. .. .. ..$ : chr "year"
##   .. .. ..- attr(*, "term.labels")= chr "year"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(lifeExp, year)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
##  - attr(*, "class")= chr "lm"

In order to extract model statistics and use them in a tidy manner, we can use a set of functions from the broom package. For these functions, the input is always the model object itself, not the original data frame.

tidy()

tidy() constructs a data frame that summarizes the model’s statistical findings. This includes coefficients and p-values for each parameter in a regression model. Note that depending on the statistical learning method employed, the statistics stored in the columns may vary.

tidy(gapminder_mod)
##          term estimate std.error statistic  p.value
## 1 (Intercept) -585.652   32.3140     -18.1 2.90e-67
## 2        year    0.326    0.0163      20.0 7.55e-80
tidy(gapminder_mod) %>%
  str()
## 'data.frame':    2 obs. of  5 variables:
##  $ term     : chr  "(Intercept)" "year"
##  $ estimate : num  -585.652 0.326
##  $ std.error: num  32.314 0.0163
##  $ statistic: num  -18.1 20
##  $ p.value  : num  2.90e-67 7.55e-80

Notice that the structure of the resulting object is a tidy data frame. Every row contains a single parameter, every column contains a single statistic, and every cell contains exactly one value.

augment()

augment() adds columns to the original data that was modeled. This could include predictions, residuals, and other observation-level statistics.

augment(gapminder_mod) %>%
  as_tibble()
## # A tibble: 1,704 x 9
##    lifeExp  year .fitted .se.fit .resid     .hat .sigma .cooksd .std.resid
##      <dbl> <int>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>   <dbl>      <dbl>
##  1    28.8  1952    50.5   0.530  -21.7 0.00208    11.6 3.63e-3      -1.87
##  2    30.3  1957    52.1   0.463  -21.8 0.00158    11.6 2.79e-3      -1.88
##  3    32.0  1962    53.8   0.401  -21.8 0.00119    11.6 2.09e-3      -1.87
##  4    34.0  1967    55.4   0.348  -21.4 0.000895   11.6 1.51e-3      -1.84
##  5    36.1  1972    57.0   0.307  -20.9 0.000698   11.6 1.13e-3      -1.80
##  6    38.4  1977    58.7   0.285  -20.2 0.000599   11.6 9.07e-4      -1.74
##  7    39.9  1982    60.3   0.285  -20.4 0.000599   11.6 9.26e-4      -1.76
##  8    40.8  1987    61.9   0.307  -21.1 0.000698   11.6 1.15e-3      -1.81
##  9    41.7  1992    63.5   0.348  -21.9 0.000895   11.6 1.59e-3      -1.88
## 10    41.8  1997    65.2   0.401  -23.4 0.00119    11.6 2.42e-3      -2.01
## # ... with 1,694 more rows

augment() will only return statistics to the original data used to estimate the model, so it cannot be used like add_predictions() to generate predictions for new data.

glance()

glance() constructs a concise one-row summary of the model. This typically contains values such as \(R^2\), adjusted \(R^2\), and residual standard error that are computed once for the entire model.

glance(gapminder_mod)
##   r.squared adj.r.squared sigma statistic  p.value df logLik   AIC   BIC
## 1      0.19         0.189  11.6       399 7.55e-80  2  -6598 13202 13218
##   deviance df.residual
## 1   230229        1702

While broom may not work with every model in R, it is compatible with a wide range of common statistical models. A full list of models with which broom is compatible can be found on the GitHub page for the package.

stargazer

stargazer is a package for R to print summary statistics and model results in a tabular format. It can format tables using \(\LaTeX\), HTML, or ASCII text. Recall our original gapminder linear model:

tidy(gapminder_mod)
##          term estimate std.error statistic  p.value
## 1 (Intercept) -585.652   32.3140     -18.1 2.90e-67
## 2        year    0.326    0.0163      20.0 7.55e-80

Let’s print the results in a nicely formatted table:

library(stargazer)

stargazer(gapminder_mod, type = "html")
Dependent variable:
lifeExp
year 0.326***
(0.016)
Constant -586.000***
(32.300)
Observations 1,704
R2 0.190
Adjusted R2 0.189
Residual Std. Error 11.600 (df = 1702)
F Statistic 399.000*** (df = 1; 1702)
Note: p<0.1; p<0.05; p<0.01

stargazer also easily compiles multiple models into a single results table:

gapminder_yr <- lm(lifeExp ~ year, data = gapminder)
gapminder_gdp <- lm(lifeExp ~ gdpPercap, data = gapminder)
gapminder_yr_gdp <- lm(lifeExp ~ year + gdpPercap, data = gapminder)
gapminder_gdp_year_lifeexp <- lm(gdpPercap ~ year + lifeExp, data = gapminder)

stargazer(gapminder_yr, gapminder_gdp, gapminder_yr_gdp, gapminder_gdp_year_lifeexp,
          type = "html")
Dependent variable:
lifeExp gdpPercap
(1) (2) (3) (4)
year 0.326*** 0.239*** -19.000
(0.016) (0.014) (12.500)
gdpPercap 0.001*** 0.001***
(0.00003) (0.00002)
lifeExp 457.000***
(16.700)
Constant -586.000*** 54.000*** -418.000*** 17,658.000
(32.300) (0.315) (27.600) (24,287.000)
Observations 1,704 1,704 1,704 1,704
R2 0.190 0.341 0.437 0.342
Adjusted R2 0.189 0.340 0.437 0.341
Residual Std. Error 11.600 (df = 1702) 10.500 (df = 1702) 9.690 (df = 1701) 8,003.000 (df = 1701)
F Statistic 399.000*** (df = 1; 1702) 880.000*** (df = 1; 1702) 661.000*** (df = 2; 1701) 441.000*** (df = 2; 1701)
Note: p<0.1; p<0.05; p<0.01

All of the output is highly customizable within stargazer to avoid manual editing of the \(\LaTeX\) or HTML code.

stargazer(gapminder_yr, gapminder_gdp, gapminder_yr_gdp, gapminder_gdp_year_lifeexp,
          type = "html",
          dep.var.labels = c("Life expectancy", "GDP per capita"),
          covariate.labels = c("Year", "GDP per capita", "Life expectancy"),
          omit.stat = c("ser", "f"))
Dependent variable:
Life expectancy GDP per capita
(1) (2) (3) (4)
Year 0.326*** 0.239*** -19.000
(0.016) (0.014) (12.500)
GDP per capita 0.001*** 0.001***
(0.00003) (0.00002)
Life expectancy 457.000***
(16.700)
Constant -586.000*** 54.000*** -418.000*** 17,658.000
(32.300) (0.315) (27.600) (24,287.000)
Observations 1,704 1,704 1,704 1,704
R2 0.190 0.341 0.437 0.342
Adjusted R2 0.189 0.340 0.437 0.341
Note: p<0.1; p<0.05; p<0.01

Generate predicted values

Use the predict() function in base R. For instance, let’s build a multiple linear regression model of life expectancy.

out <- lm(formula = lifeExp ~ gdpPercap + pop + continent,
          data = gapminder)
summary(out)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.16  -4.49   0.30   5.11  25.17 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.78e+01   3.40e-01  140.82   <2e-16 ***
## gdpPercap         4.50e-04   2.35e-05   19.16   <2e-16 ***
## pop               6.57e-09   1.98e-09    3.33    9e-04 ***
## continentAmericas 1.35e+01   6.00e-01   22.46   <2e-16 ***
## continentAsia     8.19e+00   5.71e-01   14.34   <2e-16 ***
## continentEurope   1.75e+01   6.25e-01   27.97   <2e-16 ***
## continentOceania  1.81e+01   1.78e+00   10.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.37 on 1697 degrees of freedom
## Multiple R-squared:  0.582,  Adjusted R-squared:  0.581 
## F-statistic:  394 on 6 and 1697 DF,  p-value: <2e-16

We can use add_predictions() to generate predicted values. First we need to create a synthetic dataset of the observations for which we want to predict. If we focus on GDP per capita, we could hold population at its median value and calculate predicted values for all the continents. data_grid() automatically expands a data frame to generate all possible combinations of values.

min_gdp <- min(gapminder$gdpPercap)
max_gdp <- max(gapminder$gdpPercap)
med_pop <- median(gapminder$pop)

pred_df <- expand.grid(gdpPercap = (seq(from = min_gdp,
                                        to = max_gdp,
                                        length.out = 100)),
                       pop = med_pop,
                       continent = c("Africa", "Americas",
                                     "Asia", "Europe", "Oceania")) %>%
  as_tibble

pred_df
## # A tibble: 500 x 3
##    gdpPercap      pop continent
##        <dbl>    <dbl> <fct>    
##  1      241. 7023596. Africa   
##  2     1385. 7023596. Africa   
##  3     2530. 7023596. Africa   
##  4     3674. 7023596. Africa   
##  5     4818. 7023596. Africa   
##  6     5962. 7023596. Africa   
##  7     7107. 7023596. Africa   
##  8     8251. 7023596. Africa   
##  9     9395. 7023596. Africa   
## 10    10540. 7023596. Africa   
## # ... with 490 more rows

Now we can add the predictions to the data frame using predict():

pred_out <- predict(object = out,
                    newdata = pred_df,
                    interval = "predict") %>%
  as_tibble
pred_out
## # A tibble: 500 x 3
##      fit   lwr   upr
##    <dbl> <dbl> <dbl>
##  1  48.0  31.5  64.4
##  2  48.5  32.1  64.9
##  3  49.0  32.6  65.4
##  4  49.5  33.1  65.9
##  5  50.0  33.6  66.4
##  6  50.5  34.1  67.0
##  7  51.1  34.6  67.5
##  8  51.6  35.1  68.0
##  9  52.1  35.7  68.5
## 10  52.6  36.2  69.0
## # ... with 490 more rows

Because pred_df and pred_out correspond row for row, we can combine them back together.

pred_df <- bind_cols(pred_df, pred_out)
pred_df
## # A tibble: 500 x 6
##    gdpPercap      pop continent   fit   lwr   upr
##        <dbl>    <dbl> <fct>     <dbl> <dbl> <dbl>
##  1      241. 7023596. Africa     48.0  31.5  64.4
##  2     1385. 7023596. Africa     48.5  32.1  64.9
##  3     2530. 7023596. Africa     49.0  32.6  65.4
##  4     3674. 7023596. Africa     49.5  33.1  65.9
##  5     4818. 7023596. Africa     50.0  33.6  66.4
##  6     5962. 7023596. Africa     50.5  34.1  67.0
##  7     7107. 7023596. Africa     51.1  34.6  67.5
##  8     8251. 7023596. Africa     51.6  35.1  68.0
##  9     9395. 7023596. Africa     52.1  35.7  68.5
## 10    10540. 7023596. Africa     52.6  36.2  69.0
## # ... with 490 more rows

Now we have a tidy data frame we can use to plot the predict values for the ranges we specified.

ggplot(data = pred_df %>%
         filter(continent %in% c("Europe", "Africa")),
       aes(x = gdpPercap,
           y = fit, ymin = lwr, ymax = upr,
           color = continent,
           fill = continent,
           group = continent)) +
  geom_point(data = gapminder %>%
               filter(continent %in% c("Europe", "Africa")),
             aes(x = gdpPercap, y = lifeExp,
                 color = continent),
             alpha = 0.5,
             inherit.aes = FALSE) + 
  geom_line() +
  geom_ribbon(alpha = 0.2, color = FALSE) +
  scale_x_log10(labels = scales::dollar)

Plot marginal effects

Before we plotted the predicted relationship of GDP per capita and life expectancy, holding population constant at its median value. Basically, estimates of the effect of some coefficient net of the other terms in the model. Partial or marginal effects are frequently of substantive importance, especially in models with discrete outcomes. To obtain estimates of marginal effects, we can use the margins package.

library(margins)

Let’s look at an example of survey data from the 2012 presidential election by modeling votes for Obama as a function of political views, sex, and race, with an interaction between sex and race.

library(socviz)

glimpse(gss_sm)
## Observations: 2,867
## Variables: 32
## $ year        <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20...
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ ballot      <dbl+lbl> 1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 2, 1, 2, 3, 2, 3...
## $ age         <dbl> 47, 61, 72, 43, 55, 53, 50, 23, 45, 71, 33, 86, 32...
## $ childs      <dbl> 3, 0, 2, 4, 2, 2, 2, 3, 3, 4, 5, 4, 3, 5, 7, 2, 6,...
## $ sibs        <dbl+lbl> 2, 3, 3, 3, 2, 2, 2, 6, 5, 1, 4, 4, 3, 6, 0, 1...
## $ degree      <fct> Bachelor, High School, Bachelor, High School, Grad...
## $ race        <fct> White, White, White, White, White, White, White, O...
## $ sex         <fct> Male, Male, Male, Female, Female, Female, Male, Fe...
## $ region      <fct> New England, New England, New England, New England...
## $ income16    <fct> $170000 or over, $50000 to 59999, $75000 to $89999...
## $ relig       <fct> None, None, Catholic, Catholic, None, None, None, ...
## $ marital     <fct> Married, Never Married, Married, Married, Married,...
## $ padeg       <fct> Graduate, Lt High School, High School, NA, Bachelo...
## $ madeg       <fct> High School, High School, Lt High School, High Sch...
## $ partyid     <fct> Independent, Ind,near Dem, Not Str Republican, Not...
## $ polviews    <fct> Moderate, Liberal, Conservative, Moderate, Slightl...
## $ happy       <fct> Pretty Happy, Pretty Happy, Very Happy, Pretty Hap...
## $ partners    <fct> NA, 1 Partner, 1 Partner, NA, 1 Partner, 1 Partner...
## $ grass       <fct> NA, Legal, Not Legal, NA, Legal, Legal, NA, Not Le...
## $ zodiac      <fct> Aquarius, Scorpio, Pisces, Cancer, Scorpio, Scorpi...
## $ pres12      <dbl+lbl> 3, 1, 2, 2, 1, 1, NA, NA, NA, 2, NA, NA, 1, 1,...
## $ wtssall     <dbl> 0.957, 0.478, 0.957, 1.914, 1.435, 0.957, 1.435, 0...
## $ income_rc   <fct> Gt $170000, Gt $50000, Gt $75000, Gt $170000, Gt $...
## $ agegrp      <fct> Age 45-55, Age 55-65, Age 65+, Age 35-45, Age 45-5...
## $ ageq        <fct> Age 34-49, Age 49-62, Age 62+, Age 34-49, Age 49-6...
## $ siblings    <fct> 2, 3, 3, 3, 2, 2, 2, 6+, 5, 1, 4, 4, 3, 6+, 0, 1, ...
## $ kids        <fct> 3, 0, 2, 4+, 2, 2, 2, 3, 3, 4+, 4+, 4+, 3, 4+, 4+,...
## $ religion    <fct> None, None, Catholic, Catholic, None, None, None, ...
## $ bigregion   <fct> Northeast, Northeast, Northeast, Northeast, Northe...
## $ partners_rc <fct> NA, 1, 1, NA, 1, 1, NA, 1, NA, 3, 1, NA, 1, NA, 0,...
## $ obama       <dbl> 0, 1, 0, 0, 1, 1, NA, NA, NA, 0, NA, NA, 1, 1, 0, ...
# make moderates the reference category (i.e. 0)
gss_sm$polviews_m <- relevel(gss_sm$polviews, ref = "Moderate")

# build the model
out_bo <- glm(obama ~ polviews_m + sex*race,
              family = "binomial", data = gss_sm)
summary(out_bo)
## 
## Call:
## glm(formula = obama ~ polviews_m + sex * race, family = "binomial", 
##     data = gss_sm)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.905  -0.554   0.177   0.542   2.244  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       0.29649    0.13409    2.21   0.0270 *  
## polviews_mExtremely Liberal       2.37295    0.52504    4.52  6.2e-06 ***
## polviews_mLiberal                 2.60003    0.35667    7.29  3.1e-13 ***
## polviews_mSlightly Liberal        1.29317    0.24843    5.21  1.9e-07 ***
## polviews_mSlightly Conservative  -1.35528    0.18129   -7.48  7.7e-14 ***
## polviews_mConservative           -2.34746    0.20038  -11.71  < 2e-16 ***
## polviews_mExtremely Conservative -2.72738    0.38721   -7.04  1.9e-12 ***
## sexFemale                         0.25487    0.14537    1.75   0.0796 .  
## raceBlack                         3.84953    0.50132    7.68  1.6e-14 ***
## raceOther                        -0.00214    0.43576    0.00   0.9961    
## sexFemale:raceBlack              -0.19751    0.66007   -0.30   0.7648    
## sexFemale:raceOther               1.57483    0.58766    2.68   0.0074 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2247.9  on 1697  degrees of freedom
## Residual deviance: 1345.9  on 1686  degrees of freedom
##   (1169 observations deleted due to missingness)
## AIC: 1370
## 
## Number of Fisher Scoring iterations: 6

We could graph the raw parameters, but they are in the log-odds form and not directly intuitive. We can use margins() to calculate the marginal effects for each variable:

bo_m <- margins(out_bo)
## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!
summary(bo_m)
##                            factor     AME     SE        z      p   lower
##            polviews_mConservative -0.4119 0.0283 -14.5394 0.0000 -0.4674
##  polviews_mExtremely Conservative -0.4538 0.0420 -10.7971 0.0000 -0.5361
##       polviews_mExtremely Liberal  0.2681 0.0295   9.0996 0.0000  0.2103
##                 polviews_mLiberal  0.2768 0.0229  12.0736 0.0000  0.2319
##   polviews_mSlightly Conservative -0.2658 0.0330  -8.0596 0.0000 -0.3304
##        polviews_mSlightly Liberal  0.1933 0.0303   6.3896 0.0000  0.1340
##                         raceBlack  0.4032 0.0173  23.3568 0.0000  0.3694
##                         raceOther  0.1247 0.0386   3.2297 0.0012  0.0490
##                         sexFemale  0.0443 0.0177   2.5073 0.0122  0.0097
##    upper
##  -0.3564
##  -0.3714
##   0.3258
##   0.3218
##  -0.2011
##   0.2526
##   0.4371
##   0.2005
##   0.0789

margins includes some built-in plotting mechanisms using base graphics. To build plots in ggplot(), we need to convert this summary object to a data frame and clean up the labels.

bo_gg <- as_tibble(summary(bo_m))
prefixes <- c("polviews_m", "sex")
bo_gg$factor <- prefix_strip(bo_gg$factor, prefixes)
bo_gg$factor <- prefix_replace(bo_gg$factor, "race", "Race: ")

bo_gg %>%
  select(factor, AME, lower, upper) 
## # A tibble: 9 x 4
##   factor                     AME    lower   upper
## * <chr>                    <dbl>    <dbl>   <dbl>
## 1 Conservative           -0.412  -0.467   -0.356 
## 2 Extremely Conservative -0.454  -0.536   -0.371 
## 3 Extremely Liberal       0.268   0.210    0.326 
## 4 Liberal                 0.277   0.232    0.322 
## 5 Slightly Conservative  -0.266  -0.330   -0.201 
## 6 Slightly Liberal        0.193   0.134    0.253 
## 7 Race: Black             0.403   0.369    0.437 
## 8 Race: Other             0.125   0.0490   0.200 
## 9 Female                  0.0443  0.00967  0.0789

And now graph using skills we already have:

ggplot(data = bo_gg, aes(x = reorder(factor, AME),
                         y = AME, ymin = lower, ymax = upper)) +
  geom_hline(yintercept = 0, color = "gray80") +
  geom_pointrange() + coord_flip() +
  labs(x = NULL, y = "Average Marginal Effect") 

Note that as the average marginal effect, this is the average effect of each variable for all the respondents in the dataset.

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-04-24
## Packages -----------------------------------------------------------------
##  package     * version    date       source                             
##  assertthat    0.2.0      2017-04-11 CRAN (R 3.4.0)                     
##  backports     1.1.2      2017-12-13 CRAN (R 3.4.3)                     
##  base        * 3.4.3      2017-12-07 local                              
##  bindr         0.1.1      2018-03-13 CRAN (R 3.4.3)                     
##  bindrcpp      0.2.2.9000 2018-04-08 Github (krlmlr/bindrcpp@bd5ae73)   
##  broom       * 0.4.4      2018-03-29 CRAN (R 3.4.3)                     
##  cellranger    1.1.0      2016-07-27 CRAN (R 3.4.0)                     
##  cli           1.0.0      2017-11-05 CRAN (R 3.4.2)                     
##  colorspace    1.3-2      2016-12-14 CRAN (R 3.4.0)                     
##  compiler      3.4.3      2017-12-07 local                              
##  crayon        1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab)
##  data.table    1.10.4-3   2017-10-27 CRAN (R 3.4.2)                     
##  datasets    * 3.4.3      2017-12-07 local                              
##  devtools      1.13.5     2018-02-18 CRAN (R 3.4.3)                     
##  digest        0.6.15     2018-01-28 CRAN (R 3.4.3)                     
##  dplyr       * 0.7.4.9003 2018-04-08 Github (tidyverse/dplyr@b7aaa95)   
##  evaluate      0.10.1     2017-06-24 CRAN (R 3.4.1)                     
##  forcats     * 0.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  foreign       0.8-69     2017-06-22 CRAN (R 3.4.3)                     
##  ggplot2     * 2.2.1.9000 2018-04-24 Github (tidyverse/ggplot2@3c9c504) 
##  ggrepel     * 0.7.0      2017-09-29 CRAN (R 3.4.2)                     
##  glue          1.2.0      2017-10-29 CRAN (R 3.4.2)                     
##  graphics    * 3.4.3      2017-12-07 local                              
##  grDevices   * 3.4.3      2017-12-07 local                              
##  grid          3.4.3      2017-12-07 local                              
##  gtable        0.2.0      2016-02-26 CRAN (R 3.4.0)                     
##  haven         1.1.1      2018-01-18 CRAN (R 3.4.3)                     
##  hms           0.4.2      2018-03-10 CRAN (R 3.4.3)                     
##  htmltools     0.3.6      2017-04-28 CRAN (R 3.4.0)                     
##  htmlwidgets   1.0        2018-01-20 CRAN (R 3.4.3)                     
##  httr          1.3.1      2017-08-20 CRAN (R 3.4.1)                     
##  jsonlite      1.5        2017-06-01 CRAN (R 3.4.0)                     
##  knitr       * 1.20       2018-02-20 CRAN (R 3.4.3)                     
##  lattice       0.20-35    2017-03-25 CRAN (R 3.4.3)                     
##  lazyeval      0.2.1      2017-10-29 CRAN (R 3.4.2)                     
##  lubridate     1.7.4      2018-04-11 CRAN (R 3.4.3)                     
##  magrittr      1.5        2014-11-22 CRAN (R 3.4.0)                     
##  memoise       1.1.0      2017-04-21 CRAN (R 3.4.0)                     
##  methods     * 3.4.3      2017-12-07 local                              
##  mnormt        1.5-5      2016-10-15 CRAN (R 3.4.0)                     
##  modelr      * 0.1.1      2017-08-10 local                              
##  munsell       0.4.3      2016-02-13 CRAN (R 3.4.0)                     
##  nlme          3.1-137    2018-04-07 CRAN (R 3.4.4)                     
##  parallel      3.4.3      2017-12-07 local                              
##  pillar        1.2.1      2018-02-27 CRAN (R 3.4.3)                     
##  pkgconfig     2.0.1      2017-03-21 CRAN (R 3.4.0)                     
##  plotly      * 4.7.1      2017-07-29 CRAN (R 3.4.1)                     
##  plyr          1.8.4      2016-06-08 CRAN (R 3.4.0)                     
##  psych         1.8.3.3    2018-03-30 CRAN (R 3.4.4)                     
##  purrr       * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
##  R6            2.2.2      2017-06-17 CRAN (R 3.4.0)                     
##  Rcpp          0.12.16    2018-03-13 CRAN (R 3.4.4)                     
##  readr       * 1.1.1      2017-05-16 CRAN (R 3.4.0)                     
##  readxl        1.0.0      2017-04-18 CRAN (R 3.4.0)                     
##  reshape2      1.4.3      2017-12-11 CRAN (R 3.4.3)                     
##  rlang         0.2.0.9001 2018-04-24 Github (r-lib/rlang@82b2727)       
##  rmarkdown     1.9        2018-03-01 CRAN (R 3.4.3)                     
##  rprojroot     1.3-2      2018-01-03 CRAN (R 3.4.3)                     
##  rstudioapi    0.7        2017-09-07 CRAN (R 3.4.1)                     
##  rvest         0.3.2      2016-06-17 CRAN (R 3.4.0)                     
##  scales        0.5.0.9000 2018-04-24 Github (hadley/scales@d767915)     
##  socviz      * 0.3.0      2018-04-08 Github (kjhealy/socviz@a0e00ac)    
##  stats       * 3.4.3      2017-12-07 local                              
##  stringi       1.1.7      2018-03-12 CRAN (R 3.4.3)                     
##  stringr     * 1.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  tibble      * 1.4.2      2018-01-22 CRAN (R 3.4.3)                     
##  tidyr       * 0.8.0      2018-01-29 CRAN (R 3.4.3)                     
##  tidyselect    0.2.4      2018-02-26 CRAN (R 3.4.3)                     
##  tidyverse   * 1.2.1      2017-11-14 CRAN (R 3.4.2)                     
##  tools         3.4.3      2017-12-07 local                              
##  utils       * 3.4.3      2017-12-07 local                              
##  viridisLite   0.3.0      2018-02-01 CRAN (R 3.4.3)                     
##  withr         2.1.2      2018-04-24 Github (jimhester/withr@79d7b0d)   
##  xml2          1.2.0      2018-01-24 CRAN (R 3.4.3)                     
##  yaml          2.1.18     2018-03-08 CRAN (R 3.4.4)

  1. geom_smooth() automatically implements the gam method for datasets with greater than 1000 observations.

  2. Example drawn from ggplot2 : Quick correlation matrix heatmap - R software and data visualization.